The Shiny app will be self-contained. Rather than cluttering the Shiny app, this Rmarkdown document has been created for further exploration into the data.
Some findings could invert over time. I tried to make write-up as flexible as possible with inline code chunks, but–as more data gets added–the results from these tests will change. To see the results from when these tests were first performed, change the input to read_csv() so that it reads in the csv file found in the “Analysis” folder.
This was written under R version 4.0.3.
The La Junta dataset is quickly becoming one of my favorites. I suppose there is something to be said about collecting, cleaning, and analyzing data all by yourself.
if(!require(pacman)) install.packages("pacman")
# General Packages
pacman::p_load(dplyr, readr, ggplot2,
patchwork, lubridate,
knitr, broom, stringr,
plotly)
# Imputation Packages
pacman::p_load(VIM)
# KNN Packages
pacman::p_load(rsample, class)
# Logistic Regression Packages
pacman::p_load(pscl, caret, car,
InformationValue)
# LDA Packages
pacman::p_load(MASS)
# SVM Packages
pacman::p_load(e1071)
# The raw data for La Junta, which is on GitHub
# (as opposed to the local file):
lajunta <- read_csv(paste0("https://raw.githubusercontent.com/",
"Ckrenzer/Winter-Livestock-Data/main/",
"La%20Junta%20Market%20Reports.csv"),
col_types = cols(Date = "D",
Buyer = col_character(),
Quantity = col_double(),
Type = col_character(),
Weight = col_double(),
Price = col_double(),
Reprod = col_character()))
# Know that I only used paste0() on the url
# to avoid getting into that dark LaTex magic
# for pdf output!
# Removing the URL column
# Remove missing values for Weight and Price--missing values
# for Reprod will be imputed.
# Missing values for neither Date nor Quantity will be imputed or filtered out.
lajunta <- lajunta %>%
dplyr::select(-URL) %>%
filter(!is.na(lajunta$Weight),
!is.na(lajunta$Price))
To familiarize you with the dataset, let’s take a quick peek:
| Date | Quantity | Type | Weight | Price | Reprod |
|---|---|---|---|---|---|
| 2016-01-05 | 10 | ang | 410 | 220 | str |
| 2016-01-05 | 24 | ang | 490 | 213 | str |
| 2016-01-05 | 5 | ang | 400 | 197 | hfr |
| 2016-01-05 | 3 | lim | 475 | 214 | str |
| 2016-01-05 | 4 | lim | 525 | 205 | str |
| 2016-01-05 | 3 | lim | 430 | 185 | hfr |
| 2016-01-05 | 11 | clr | 485 | 210 | str |
| 2016-01-05 | 11 | clr | 450 | 180 | hfr |
You can see that we have data for the date of the sale, the quantity sold, the type of livestock, the weight in pounds, the price in USD, and the livestock’s reproductive status. There is also data on Buyer names, but I thought it would be rude to include that information in this table. It hasn’t proved very useful anyway.
From my previous work (see “La Junta Predictions.Rmd,” for instance), I can confidently say that the “Type” column is utterly useless. That’s a story in and of itself, I suppose. The simplest graph that tells the bulk of what we need to know is this one:
The dark theme on ggplot looks really nice, don’t you think? We can clearly see four distinct groups and will explore these relationships in various ways.
Some of the data contains missing values. I will not impute missing data for Date, but the Reprod column can be reasonably estimated. The next section will walk through KNN in-depth, but just know that it will be used to impute missing values. The VIM package is used to perform this algorithm.
lajunta <- VIM::kNN(data = lajunta,
variable = "Reprod",
dist_var = c("Price", "Weight", "Quantity"))
Since the data clusters so nicely, I could not resist checking KNN’s performance!
I will be using the rsample package to split the data into testing and training sets. The class package will be used for the KNN algorithm. With this algorithm, I will be classifying cattle reproductive status (cow, bull, steer, or heifer) using weight and market price.
Lucky for me, I already spent a full day cleaning this data! All I have to do is select relevant columns, make the character column a factor, and standardize the numeric columns.
There are only two predictors in this model, weight and price.
# Fortunately, KNN is well-suited for just a few predictors
lajunta_knn <- lajunta %>%
dplyr::select(Weight, Price, Reprod)
# storing the string vector as a factor
lajunta_knn$Reprod <- as.factor(lajunta_knn$Reprod)
Since KNN is sensitive to different scales, the values of the predictors need to be standardized.
# The normalizing function, which transforms
# all values into the range from 0 to 1
normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
# Applying normalize() on the predictor
# columns
lajunta_knn <- lajunta_knn %>%
mutate(across(where(is.numeric), normalize))
A stratified, simple random sampling technique will be used to separate values into training and testing data. Stratification is necessary in this instance because there are far fewer cows and bulls in the dataset than there are heifers or steers. Eighty percent of the values will be placed in the training set and the remaining twenty percent will go to the test set.
# Setting the random number stream
set.seed(2021)
lajunta_knn_split <- rsample::initial_split(lajunta_knn,
prob = 0.80,
strata = Reprod)
# As of 2/6/2021, the split is 869 in train
# and 289 in test, with a total of 1158 observations
# separating the data into two new data frames
lajunta_knn_train <- rsample::training(lajunta_knn_split)
lajunta_knn_test <- rsample::testing(lajunta_knn_split)
To determine the appropriate number for K, I took the square root of the number of observations in the training set.
# Determining the appropriate value of K
optimal_k <- floor(sqrt(nrow(lajunta_knn_train)))
cat("Optimal value of K:", optimal_k)
## Optimal value of K: 115
# To explain the variable names, know that the
# value of optimal_k is 29 (and some change)
# at the time of writing
# the cl argument is the factor of the true
# classifications from the training set
knn_29 <- class::knn(train = lajunta_knn_train[1:2],
test = lajunta_knn_test[1:2],
cl = lajunta_knn_train$Reprod,
k = optimal_k)
To calculate the accuracy, the data was thrown into a confusion matrix and the number of correct classifications were calculated as a rate. Though not shown, all cows and bulls are predicted perfectly. The only data which are not classified correctly are steers and heifers.
# Create a confusion matrix--The result from knn()
# and the test data dependent variable are the
# arguments
knn_29_mat <- table(knn_29, lajunta_knn_test$Reprod)
#This function measures accuracy. It takes the confusion matrix as input
accuracy <- function(x){
return(sum(diag(x)/(sum(rowSums(x)))) * 100)
}
cat("Accuracy", accuracy(knn_29_mat)) # 87.2% as of 2/6/2021
## Accuracy 78.9615
We can always do better, no? Let’s run this again in a loop with changing values of K. The K with the greatest accuracy should be used for all future predictions. Also, this value should be somewhat close to 115–if we get a number very far from 115, it is likely that we are overfitting the model. That pesky bias-variance trade-off…
# Since I am using a loop, best practice
# is to pre-allocate space in the vector
acc <- numeric(100)
for(i in 1:100){
knn_mod <- class::knn(train = lajunta_knn_train[1:2],
test = lajunta_knn_test[1:2],
cl = lajunta_knn_train$Reprod,
k = i)
acc[i] <- accuracy(table(knn_mod, lajunta_knn_test$Reprod))
}
The ideal value was 70–very close to 115! Using just two variables, we can predict the reproductive status of livestock with 79.25 percent accuracy! As previously stated, most of these discrepancies are between steers and heifers. If we are only interested in cows and bulls, we will virtually always have the right prediction!
I am going to repeat this process with Quantity added to the model. Will the accuracy improve?
The accuracy after adding Quantity into the model at the optimal value of K is 78.98 percent–slightly worse than the model with only two variables. The more you know the better!
The data appears linear, so linear models seem appropriate for this dataset. There is a linear model for each reproductive status. You can see each model’s parameters below:
# All results are from 2/6/2021...
steer_fit <- lm(data = lajunta,
subset = Reprod == "str",
formula = Price ~ Date +
Quantity + poly(Weight, degree = 2))
heifer_fit <- lm(data = lajunta,
subset = Reprod == "hfr",
formula = Price ~ Date +
Quantity + poly(Weight, degree = 2))
# a 12th degree polynomial gave a higher
# R-squared, but I think that is due to overfitting
cow_fit <- lm(data = lajunta,
subset = Reprod == "cow",
formula = Price ~ Date +
poly(Weight, 3))
# a 6th degree polynomial gave a higher
# R-squared, but I think that is due to overfitting...
# Our sample is small so we want to keep the
# number of predictors small
bull_fit <- lm(data = lajunta,
subset = Reprod == "bull",
formula = Price ~ Date +
poly(Weight, 2))
Now, let’s see how these models performed:
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 237.6801356 | 5.7391025 | 41.414165 | 0.0000000 |
| Date | -0.0042055 | 0.0003147 | -13.363495 | 0.0000000 |
| Quantity | 0.0496305 | 0.0076020 | 6.528608 | 0.0000000 |
| poly(Weight, degree = 2)1 | 764.1255635 | 384.2806445 | 1.988457 | 0.0467966 |
| poly(Weight, degree = 2)2 | 3989.7837010 | 184.8862644 | 21.579665 | 0.0000000 |
## R-squared: 0.5798677
## Ajdusted R-squared: 0.5796489
## RSE: 15.84274
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 264.3753190 | 8.7653938 | 30.161260 | 0 |
| Date | -0.0057135 | 0.0004752 | -12.024470 | 0 |
| Quantity | 0.0851730 | 0.0124105 | 6.862962 | 0 |
| poly(Weight, degree = 2)1 | 6459.9550740 | 685.2010416 | 9.427824 | 0 |
| poly(Weight, degree = 2)2 | 5335.2261148 | 316.4796829 | 16.858037 | 0 |
## R-squared: 0.2492122
## Ajdusted R-squared: 0.2487494
## RSE: 21.66713
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 252.8672587 | 23.563099 | 10.7314941 | 0.0000000 |
| Date | -0.0102592 | 0.001255 | -8.1747910 | 0.0000000 |
| poly(Weight, 3)1 | -224.7639870 | 680.926256 | -0.3300857 | 0.7413635 |
| poly(Weight, 3)2 | 325.2531155 | 391.011291 | 0.8318254 | 0.4055888 |
| poly(Weight, 3)3 | 89.5952590 | 353.504705 | 0.2534486 | 0.7999429 |
## R-squared: 0.02720644
## Ajdusted R-squared: 0.0256143
## RSE: 34.46267
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 228.5792740 | 8.3657631 | 27.32318 | 0 |
| Date | -0.0047463 | 0.0004726 | -10.04247 | 0 |
| poly(Weight, 2)1 | -3582.7868927 | 43.5314255 | -82.30346 | 0 |
| poly(Weight, 2)2 | 1148.4688335 | 24.2490141 | 47.36146 | 0 |
## R-squared: 0.8487049
## Ajdusted R-squared: 0.8483401
## RSE: 9.078266
Linear regression is an art as much as it is a science. I could write a 30 page paper explaining all the EDA, providing a play-by-play commentary about potential improvements to this model. But today I think I’ll leave things as they are presented here.
See the accompanying Shiny app (“La Junta Predictions.Rmd,” also linked above) if you’d like to play with these linear models yourself!
There are a few things I want to try with a logistic regression. Though logistic regression can handle more than two categories, it is best-suited for binary dependent variables. Therefore, I will be evaluating three models with binary outcomes:
Due to enormous differences between cows/bulls and heifers/steers, I expect zero error in the third regression.
First things first. In the first regression, only data for steers and heifers will be included. In the second regression, only data for bulls and cows will be included. In the third regression, steers and heifers will be recoded as “y,” for “young,” while cows and bulls will be recoded as “par,” for “parents” (and yes, I understand that bulls are not necessarily fathers; I had trouble coming up with a good name for these categories). With this in mind, let’s store data for the models:
# Data for regression 1
lajunta_logit1 <- lajunta %>%
filter(Reprod %in% c("str", "hfr")) %>%
mutate(across(where(is.character), as.factor))
# Data for regression 2
lajunta_logit2 <- lajunta %>%
filter(Reprod %in% c("bull", "cow")) %>%
mutate(across(where(is.character), as.factor))
# Data for regression 3
lajunta_logit3 <- lajunta %>%
mutate(Reprod = lajunta[["Reprod"]] %>%
str_replace_all("bull|cow", "par") %>%
str_replace_all("str|hfr", "y")) %>%
mutate(across(where(is.character), as.factor))
Now that we have everything in a tibble, we are ready to split up the data.
The data will be split into testing and training sets with an 80/20 split. For the same reasons used in the KNN classification section, stratified sampling will be used.
# REGRESSION 1 (steer vs heifer)
# Setting the random number stream
set.seed(2021)
# Stratification for the first regression
lajunta_logit1_split <- rsample::initial_split(lajunta_logit1,
prob = 0.80,
strata = Reprod)
# separating the data into two new data frames
lajunta_logit1_train <- rsample::training(lajunta_logit1_split)
lajunta_logit1_test <- rsample::testing(lajunta_logit1_split)
# REGRESSION 2 (bull vs cow)
# Setting the random number stream again
# (in the event R tries to pull some
# funny business)
set.seed(2021)
# Stratification for the first regression
lajunta_logit2_split <- rsample::initial_split(lajunta_logit2,
prob = 0.80,
strata = Reprod)
# separating the data into two new data frames
lajunta_logit2_train <- rsample::training(lajunta_logit2_split)
lajunta_logit2_test <- rsample::testing(lajunta_logit2_split)
# REGRESSION 3 (bull/cow vs. steer/heifer)
# Setting the random number stream again
# (in the event R tries to pull some
# funny business)
set.seed(2021)
# Stratification for the first regression
lajunta_logit3_split <- rsample::initial_split(lajunta_logit3,
prob = 0.80,
strata = Reprod)
# separating the data into two new data frames
lajunta_logit3_train <- rsample::training(lajunta_logit3_split)
lajunta_logit3_test <- rsample::testing(lajunta_logit3_split)
The data is now ready for model assembly. I will be using the stats package’s glm() function as the engine for the regressions.
# REGRESSION 1 (steer vs heifer)
# Quantity is not very different between steers and heifers
logit1 <- glm(data = lajunta_logit1_train,
formula = Reprod ~ Weight + Price,
family = "binomial")
# REGRESSION 2 (bull vs cow)
logit2 <- glm(data = lajunta_logit2_train,
formula = Reprod ~ Weight + Price + Quantity,
family = "binomial")
# REGRESSSION 3 (bull/cow vs. steer/heifer)
logit3 <- glm(data = lajunta_logit3_train,
formula = Reprod ~ Weight + Price + Quantity,
family = "binomial")
Here are the results from the first regression (steers vs heifers):
##
## Call:
## glm(formula = Reprod ~ Weight + Price, family = "binomial", data = lajunta_logit1_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6479 -0.9118 0.3714 0.8513 2.7400
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.756e+01 3.851e-01 -45.60 <2e-16 ***
## Weight 9.820e-03 2.553e-04 38.47 <2e-16 ***
## Price 7.821e-02 1.716e-03 45.59 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 14669 on 10635 degrees of freedom
## Residual deviance: 11386 on 10633 degrees of freedom
## AIC: 11392
##
## Number of Fisher Scoring iterations: 4
From the second regression (bull vs cow):
##
## Call:
## glm(formula = Reprod ~ Weight + Price + Quantity, family = "binomial",
## data = lajunta_logit2_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4311 -0.1328 0.0933 0.2344 8.4904
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 22.4200114 0.9133731 24.546 <2e-16 ***
## Weight -0.0078604 0.0003671 -21.410 <2e-16 ***
## Price -0.1259654 0.0057170 -22.033 <2e-16 ***
## Quantity 0.0702418 0.0363685 1.931 0.0534 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3546.06 on 2772 degrees of freedom
## Residual deviance: 667.32 on 2769 degrees of freedom
## AIC: 675.32
##
## Number of Fisher Scoring iterations: 9
And the third regression (bull/cow vs. steer/heifer):
##
## Call:
## glm(formula = Reprod ~ Weight + Price + Quantity, family = "binomial",
## data = lajunta_logit3_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.5231 0.0004 0.0265 0.0790 1.6321
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.2393515 1.0610136 8.708 <2e-16 ***
## Weight -0.0115410 0.0007183 -16.067 <2e-16 ***
## Price 0.0038388 0.0043474 0.883 0.377
## Quantity 0.2000384 0.0149191 13.408 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13668.3 on 13407 degrees of freedom
## Residual deviance: 1214.3 on 13404 degrees of freedom
## AIC: 1222.3
##
## Number of Fisher Scoring iterations: 9
The predictors in the first regression did a pretty good job and, judging by the p-values, the predictors in the second and third regressions did an abysmal job!
The correct interpretation of the Price variable in the first regression, however, is that a one dollar increase in the price of the livestock is associated with an average increase of 0.0782118 in the log odds of being a steer.
Well, this is quite strange. Let’s continue on to the evaluation and see if we can figure out what’s happening.
We can use the pR2() function from the pscl package to compute McFadden’s R-square value, which ranges from 0 to just under 1. Values close to 0 indicate the model as no predictive power. Values over 0.4 suggest the model is a good fit for the data.
Let’s begin with the first regression:
## fitting null model for pseudo-r2
## 0.2238441
As expected, the predictors in the first regression do a pretty good job. Let’s see the McFadden statistic from the second regression:
## fitting null model for pseudo-r2
## 0.8118127
Ah! I see now. The model acts strangely because it is deterministic! I expect this to be the case for the third regression as well:
## fitting null model for pseudo-r2
## 0.9111578
Yup.
We can compute the importance of each predictor variable in the model by using the varImp() function from the caret package. Higher values indicate more importance. These values can help confirm the validity of p-values.
The first regression:
| Overall | |
|---|---|
| Weight | 38.47022 |
| Price | 45.59010 |
The values are both pretty big, so both of the predictors seem to be ‘carrying their weight.’ This is consistent with our analysis thus far.
What about the second regression?
| Overall | |
|---|---|
| Weight | 21.409533 |
| Price | 22.033420 |
| Quantity | 1.931391 |
That is certainly consistent with the p-values. The third regression?
| Overall | |
|---|---|
| Weight | 16.0665342 |
| Price | 0.8830064 |
| Quantity | 13.4081981 |
Quite low, quite low.
We can calculate the VIF of each variable to see if multicollinearity is a problem. VIF scores greater than 5 indicate multicollinearity is an issue that needs to be addressed. Since there are so few variables, multicollinearity is unlikely to have much effect on the results.
The first regression:
## Weight Price
## 2.104414 2.104414
What a relief! Multicollinearity, as measured by VIF, is not a big problem in the data.
The second regression:
## Weight Price Quantity
## 1.523993 1.434929 1.092293
Eh, this could be a problem, but they are still pretty close to five. I opt to willfully ignore the problem. Not shown here, but I already tried transforming the variables to no avail.
The third regression:
## Weight Price Quantity
## 3.892188 3.829056 1.078470
Good. These scores are not very worrying to me.
Finally, we can evaluate model performance by using the test dataset. By default, any observation in the test dataset with a probability greater than 0.5 will be predicted to be a steer, cow, or heifer/steer (a “y”), respectively (for regressions 1, 2, and 3). We can calculate these probabilities with this code:
# REGRESSION 1 (steer vs heifer)
# calculating the probability of being a steer
# and storing the results in a vector
logit1_test_probs <- predict(logit1,
lajunta_logit1_test,
type = "response")
# REGRESSION 2 (bull vs cow)
# calculating the probability of being a cow
# and storing the results in a vector
logit2_test_probs <- predict(logit2,
lajunta_logit2_test,
type = "response")
# REGRESSSION 3 (bull/cow vs. steer/heifer)
# calculating the probability of being a
# heifer/steer (a "y") and storing the
# results in a vector
logit3_test_probs <- predict(logit3,
lajunta_logit3_test,
type = "response")
Thankfully, mathematical wizards around the world have come together to develop a magical way of determining the optimal probability that maximizes accuracy of the model. The optimalCutoff() function from the InformationValue package finds this value.
Since regressions 2 and 3 are deterministic, this process will only be done on the first regression. One problem with this approach, I suppose you could argue, is that it risks overfitting the model. You can decide for yourself what you think.
# converting "str" to 1's and "hfr" to 0's
# This is needed for the optimalCutoff() call
# below
lajunta_logit1_test$Reprod <- ifelse(lajunta_logit1_test$Reprod == "str", 1, 0)
# finding optimal cutoff probability to use to maximize accuracy
logit1_optimal <- optimalCutoff(lajunta_logit1_test$Reprod,
logit1_test_probs)[1]
# print the optimal probability
cat(logit1_optimal)
## 0.53
The optimal probability is 0.53. With this new threshold (the old threshold was 0.5, remember), any livestock with a probability of being a steer at or above 0.53 will be classified as a steer, with the rest being classified as heifers.
Using the optimal threshold, we can create a confusion matrix showing how the model performed on the test data.
As always, let’s begin with the first regression:
## hfr str
## 0 1107 397
## 1 516 1524
The zeroes are heifers and the ones are steers. As we can see, 516 heifers were incorrectly identified as steers and 397 steers were incorrectly identified as heifers. Cool!
Let’s repeat the process for the second regression:
## bull cow
## 0 290 5
## 1 22 607
The model is deterministic. Zero misclassifications!
I expect the same for the third regression:
## par y
## 0 908 5
## 1 16 3540
Deterministic indeed.
We can also calculate the sensitivity (the “true positive rate”), specificity (the “true negative rate”), and the total number of misclassifications using the numbers from the confusion matrix. I’ll spare you the details on the sensitivity and specificity and cut straight to the error rate. Further, since we already know regressions 2 and 3 are deterministic, I won’t waste anyone’s precious time showing that the error rate is 0 percent in these two models.
The misclassification error rate for the first regression at the optimal threshold:
## 0.2554
The data were classified incorrectly on the test data 25.54 percent of the time.
Finally, we can put together ROC curves to visualize how well the models fit. The greater the AUC (area under the curve), the more accurately the model predicts.
For the first regression
The first regression seems to be a reasonably good fit, as measured by the ROC curve.
The second curve? That’s just a gigantic blue rectangle. The same is true of the ROC curve for the third regression. My intuition at the outset of this section was correct. There was no error in the third regression.
While we are in the neighborhood of classification models, why don’t we give LDA a go? As we all know by this point, there are four categories to be interested in classifying: bull, cow, steer, and heifer. Also, as you may have guessed, the predictors are price, weight, and quantity. Date is a strange variable, and Type has too little predictive power to bother including.
The engine for the LDA model is the one provided in the MASS package. As usual, the sampling will be done with the rsample package–in the same manner as the previous classifications.
Since one of the key assumptions of LDA is that each predictor has the same variance, they should all be scaled to give the data a mean of zero and a standard deviation of one. We can do this with a mutate(across()) call:
lajunta_lda <- lajunta %>%
mutate(across(where(is.numeric), scale),
Date = Date)
And we can confirm that the standard deviation is, indeed, 1 after transformation:
lajunta_lda %>%
summarize(across(where(is.numeric), sd)) %>%
unlist()
## Quantity Weight Price
## 1 1 1
Excellent. We are now ready to separate into training and test samples.
Due to imbalances in the number of observations in reproductive status of the livestock, stratified sampling will (again) be used. Okay, let’s make the 80/20 split:
# Setting the random number stream
set.seed(2021)
lajunta_lda_split <- rsample::initial_split(lajunta_lda,
prob = 0.80,
strata = Reprod)
# separating the data into two new data frames
lajunta_lda_train <- rsample::training(lajunta_lda_split)
lajunta_lda_test <- rsample::testing(lajunta_lda_split)
Now, let’s fit the model and see the results:
lajunta_lda_fit <- MASS::lda(Reprod ~ Price + Weight + Quantity,
data = lajunta_lda_train)
lajunta_lda_fit
## Call:
## lda(Reprod ~ Price + Weight + Quantity, data = lajunta_lda_train)
##
## Prior probabilities of groups:
## bull cow hfr str
## 0.07040048 0.13647550 0.36371094 0.42941308
##
## Group means:
## Price Weight Quantity
## bull -1.0370508 2.4503740 -0.7931096
## cow -1.6965971 1.3233879 -0.7089602
## hfr 0.1922095 -0.4763443 0.1466243
## str 0.5600227 -0.4235048 0.2288588
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## Price 0.1080869 1.6683022 -0.2548902
## Weight -2.2507730 1.3457412 0.2234531
## Quantity 0.3504477 0.1513684 1.0294523
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9212 0.0788 0.0000
Let’s break this output down bit by bit…
These represent the proportions of each reproductive status of the livestock in the training set. Bulls, for example, make up 7 percent of the observations in the training dataset.
The group means are the mean values for each predictor in the model, broken down by the possible outcomes for the dependent variable.
These coefficients represent the linear combination of predictor variables that are used to form the decision rule of the model. For example,
LD1 = (0.1080869)(Price) - (2.250773)(Weight) + (0.3504477)(Quantity)
Displays the percentage separation achieved by each linear discriminant function (Ex. if LD1 was 0.73, we would say that “LD1 is responsible for 73% of the separation between the groups”).
Following our prepare-split-train process, we have now come to the step where we analyze the test data.
lajunta_lda_probs <- predict(lajunta_lda_fit,
lajunta_lda_test)
There are three elements of a list from this prediction–the predicted class, posterior probability than an observation belongs to each class (the probability of the observation being a steer, heifer, bull, or cow–with these four probabilities summing up to 1), and the linear discriminants.
Let’s see the first few posterior probabilities in the test set:
| bull | cow | hfr | str |
|---|---|---|---|
| 0 | 0.00e+00 | 0.1861780 | 0.8138220 |
| 0 | 0.00e+00 | 0.3090660 | 0.6909339 |
| 0 | 0.00e+00 | 0.1902743 | 0.8097256 |
| 0 | 0.00e+00 | 0.1983542 | 0.8016457 |
| 0 | 0.00e+00 | 0.2050239 | 0.7949760 |
| 0 | 2.00e-07 | 0.2528935 | 0.7471063 |
| 0 | 1.50e-06 | 0.3687323 | 0.6312662 |
| 0 | 2.45e-05 | 0.3419699 | 0.6580056 |
I imagine that I am losing you by this point, because the outcome for these classification problems all approximate the same thing (that the graph in the introduction shows plain as day). I think it is good to use several approaches to confirm results, however, because that only strengthens our understanding of the interaction between the variables.
With that being said, observation of the above table once again confirms that the difference between heifers and steers is far more ambiguous than differences between any other cattle reproductive statuses.
The question I think you are likely most interested in, “Does LDA perform better than KNN or any of the logistic regressions,” can be answered easily by comparing the model’s classifications of the test data to the actual observations in the test set.
# For you non-R programmers out there,
# R mathematical functions translate
# TRUE's into 1's and FALSE's into 0's.
# In other words, sum(TRUE, TRUE, FALSE) returns 2.
mean(lajunta_lda_probs$class == lajunta_lda_test$Reprod)
Drum roll please! The model correctly identified the true reproductive status 77 percent of the time! Well, at least now we know that KNN is a better predictor if we want to classify reproductive status in the future! Regardless, they are both reasonably close to one another, so KNN is probably a better choice.
Often times, the people we need to convey insights to have no background in statistics; Occam’s Razor means something approaching, “the simpler explanation is usually the most correct.” Chances are that we will have a much easier time explaining KNN than we would describing linear discriminant analysis. If a simpler method does a “good-enough” job, choose the simpler method.
Finally, we can visualize how well LDA separated the different reproductive statuses of cattle:
We could make similar plots using LD3 as well, but LD1 and LD2 are by far the most important linear discriminants. There is no real justification for including LD3.
For those of you looking to improve this model, try transforming some of the predictors!
I am in the mood to keep going with these classification models. I want to get a model that predicts with 95 percent accuracy, but we haven’t met that threshold with any of the preceding models. This model is similar to the logistic regression model, but instead of filtering data down to just steers and heifers, it includes cows and bulls as well. The predictors include Weight, Price, and Quantity.
The question this model answers: “is the livestock a steer?”
The e1071 package will be the engine used to implement the algorithm. The plotly package was used to create the diagram in the next section.
The algorithm is supposed to classify observations by creating a hyperplane with the maximum possible margins (in a 2D setting, you can think of a plane as a line and the margins as the “buffer zone” between the points and the line–we want the largest “buffer zone” possible). Even in the 3D plane, values between steers and heifers overlap often. To remedy this, a kernel function will be used to transform the data into a higher dimension. You can examine the existing overlap yourself using the interactive diagram below (HTML file only).
Character variables will be transformed into factors, the Date column will become a date type, and the logical column, “is_Steer,” will be added to the data frame. Note that missing values for all variables are removed.
#note: remember that missing values for Price and Weight
# were filtered out, and missing values for Reprod
# were imputed using K-nearest neighbors.
lajunta_svm <- lajunta %>%
mutate(across(where(is.character), as.factor),
Date = Date,
"is_Steer" = ifelse(Reprod == "str", TRUE, FALSE)) %>%
filter(!is.na(Date),
!is.na(Buyer),
!is.na(Quantity),
!is.na(Type),
!is.na(is_Steer))
Alright, let’s make the split:
# Setting the random number stream
set.seed(2021)
# transforming character vectors into factors
# and adding the boolean is_Steer column
lajunta_svm_split <- initial_split(lajunta_svm,
prob = 0.80,
strata = Reprod)
# separating the data into two new data frames
lajunta_svm_train <- training(lajunta_svm_split)
lajunta_svm_test <- testing(lajunta_svm_split)
Multiple different kernel functions will be tried to find the best fit. Radial basis functions are usually a go-to for these types of operations, and so this is the one I am most interested in.
# linear
lajunta_svm_classifier <- svm(data = lajunta_svm_train, is_Steer ~ Price + Weight + Quantity,
type = "C-classification", kernel = "linear")
# quadratic
lajunta_svm_classifier2 <- svm(data = lajunta_svm_train, is_Steer ~ Price + Weight + Quantity,
type = "C-classification", kernel = "polynomial", degree = 2)
# cubic
lajunta_svm_classifier3 <- svm(data = lajunta_svm_train, is_Steer ~ Price + Weight + Quantity,
type = "C-classification", kernel = "polynomial", degree = 3)
# sigmoid
lajunta_svm_classifier4 <- svm(data = lajunta_svm_train, is_Steer ~ Price + Weight + Quantity,
type = "C-classification", kernel = "sigmoid")
# radial
lajunta_svm_classifier5 <- svm(data = lajunta_svm_train, is_Steer ~ Price + Weight + Quantity,
type = "C-classification", kernel = "radial")
On to the fun part!
As usual, we now check the model’s performance using the test dataset.
# linear
lajunta_svm_predict <- predict(lajunta_svm_classifier, newdata = lajunta_svm_test %>%
dplyr::select(Weight, Price, Quantity, is_Steer))
# quadratic
lajunta_svm_predict2 <- predict(lajunta_svm_classifier2, newdata = lajunta_svm_test %>%
dplyr::select(Weight, Price, Quantity, is_Steer))
# cubic
lajunta_svm_predict3 <- predict(lajunta_svm_classifier3, newdata = lajunta_svm_test %>%
dplyr::select(Weight, Price, Quantity, is_Steer))
# sigmoid
lajunta_svm_predict4 <- predict(lajunta_svm_classifier4, newdata = lajunta_svm_test %>%
dplyr::select(Weight, Price, Quantity, is_Steer))
# radial
lajunta_svm_predict5 <- predict(lajunta_svm_classifier5, newdata = lajunta_svm_test %>%
dplyr::select(Weight, Price, Quantity, is_Steer))
Now, let’s see how the classifier performed with a confusion matrix:
| FALSE | TRUE | |
|---|---|---|
| 0 | 2065 | 475 |
| 1 | 661 | 1267 |
| FALSE | TRUE | |
|---|---|---|
| 0 | 1720 | 820 |
| 1 | 324 | 1604 |
| FALSE | TRUE | |
|---|---|---|
| 0 | 1543 | 997 |
| 1 | 218 | 1710 |
| FALSE | TRUE | |
|---|---|---|
| 0 | 1687 | 853 |
| 1 | 889 | 1039 |
| FALSE | TRUE | |
|---|---|---|
| 0 | 2163 | 377 |
| 1 | 509 | 1419 |
Using a linear kernel function, the model’s specificity and sensitivity are very similar. The misclassification error rate is similar to that of the quadratic and cubic polynomial kernel functions. What is different, of course, is that the models with quadratic and cubic kernel functions have many false negatives and false positives, respectively. What is interesting to note, however, is that the model with the quadratic kernel function has very few false positives and the the cubic function has very few false negatives (in short: quadratic = few false negatives and many false positives … cubic = few false positives and many false negatives).
In other contexts, we might want to be conservative in the amount of false positives we get–even if that means we do this at the expense of the overall misclassification error rate. If this was banking data and we wanted to predict whether someone was going to default on a loan, for instance, we may permit many false positives (predicted they will default, even though they won’t) in the model–rejecting people access to credit–if it means avoiding a false negative (in which the firm loses loads of money on someone who defaults, despite the model predicting the individual would not default). These considerations are important, but my goal is to minimize the overall misclassification error rate. People are not losing their businesses because a model predicts livestock of a given weight and price to be a steer when it is a heifer in reality!
The sigmoid function was the least accurate and performed almost as well as random guessing! Both sensitivity and specificity are low. This was the weakest model yet.
The radial function is the most accurate, having a misclassification error rate of 80.2 percent. Not quite the 95 percent we are looking for–and this, too, performs similarly to KNN clustering–so I think we have reached a plateau in predicting power using just three variables.
Let’s try adding variations to the model with a radial kernel function.
There are three new variants to the model: one with only the Date variable included, one with only the Type variable included, and one with both the Date and Type variables included. You can see confusion matrices below:
| FALSE | TRUE | |
|---|---|---|
| 0 | 2192 | 348 |
| 1 | 341 | 1587 |
| FALSE | TRUE | |
|---|---|---|
| 0 | 2119 | 421 |
| 1 | 438 | 1490 |
| FALSE | TRUE | |
|---|---|---|
| 0 | 2155 | 385 |
| 1 | 387 | 1541 |
It would appear that the Type column increases misclassification error. That’s too bad–with no prior knowledge, I really thought that the type of cattle would influence the price. I spent a lot of time organizing that column. All the work that went into that was not in vain, though, since I would not have been able to disprove the fact that the type of cattle has no bearing on its price without doing the overhead.
The accuracy of the model that added only Date is 84.6 percent, which is a huge improvement over its counterpart with only the weight, price, and quantity. This is the most accurate model, and it’s pretty close to 95 percent!
If you’ve made it this far–you’re a trooper! I hope you enjoyed going through this as much as I did! It was written over the course of four days. If you have any questions or want to provide feedback, do not hesitate to reach out. Take care now!